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Variation and selection are the core principles of Darwinian evolution, yet quantitatively relating 
the diversity of a population to its capacity to respond to selection is challenging. Here, we examine 
this problem at a molecular level in the context of populations of partially randomized proteins 
selected for binding to well-defined targets. We built several minimal protein libraries, screened 
them in vitro by phage display and analyzed their response to selection by high-throughput se¬ 
quencing. A statistical analysis of the results reveals two main hndings: first, libraries with same 
sequence diversity but built around different “frameworks” typically have vastly different responses; 
second, the distribution of responses within a library follows a simple scaling law. We show how 
an elementary probabilistic model based on extreme value theory rationalizes these findings. Our 
results have implications for designing synthetic protein libraries, for estimating the density of func¬ 
tional biomolecules in sequence space, for characterizing diversity in natural populations and for 
experimentally investigating the concept of evolvability, or potential for future evolution. 


Diversity is the fuel of evolution by natural selection 
but translating this concept into quantitative measure¬ 
ments is not straightforward [1]. A simple count of the 
number of different individuals in a population for in¬ 
stance fails to account for the very different responses 
to selection that two populations with same number of 
different individuals may elicit. The problem is already 
acute at the molecular scale where it also takes a very 
practical form: libraries of diverse proteins are routinely 
screened as a way to identify biomolecules of interest 
(binders, catalysts,...) and a proper “diversity” is crit¬ 
ical for success [2, 3]. But beyond a general agreement 
that maximizing the number of different elements is de¬ 
sirable, there is no general rule for engineering and com¬ 
paring diversity in these libraries. 

A common design of many protein libraries is to con¬ 
centrate variations at one or a few variable parts located 
around a fixed “framework”, which is shared by all mem¬ 
bers of the library [2, 3]. The natural design of antibody 
repertoires, the pools of immune proteins with poten¬ 
tial to recognize nearly every molecular target, follows 
this pattern. Most of sequence variations in antibodies 
are indeed concentrated at a few loops extending from 
a common structural scaffold [4]. This design has in¬ 
spired the conception of artificial protein libraries built 
on frameworks other than the immunoglobulin fold [5]. 

Here, we propose a new approach to quantitatively 
characterize the selective potential of molecular libraries. 
To develop this approach, we designed and screened 
24 synthetic protein libraries with identical sequence 
variations but different frameworks and analyzed their 
response to well-defined selective pressures by high- 
throughput sequencing. Between libraries, we find that 
selective potentials vary widely and define a hierarchy 
of frameworks. Within libraries, we find that selective 
potentials exhibit a simple scaling law, characterized by 
few parameters. The essence of these results is captured 
by an elementary probabilistic model based on extreme 
value theory (EVT). This leads us to propose a new mea¬ 


sure of the selective potential of a population that over¬ 
comes the shortcomings of previously proposed measures 
of diversity. 


I. METHODS 
A. Library design 

We built 24 minimal libraries with different frame¬ 
works but identical sequence diversity (Materials & 
Methods, Figures 1 and SI). Twenty frameworks consist 
of single-domain antibodies taken from natural heavy- 
chain genes of diverse origins {Vh fragments), typically 
sharing 40% of their amino acids (Figure S2); they origi¬ 
nate from maturated antibodies, which are mutated rela¬ 
tive to their germline form, except for the SI framework, 
which comes from a germline (naiVe) antibody. Three 
additional frameworks are more closely related and cor¬ 
respond to the germline and two maturated forms of 
the same human antibody, with the maturated frame¬ 
works sharing 65% and 85% sequence identity with the 
germline. Finally, one framework consists exclusively of 
glycines to serve as a control. Diversity is limited to four 
consecutive amino acids at the complementarity deter¬ 
mining region 3 (CDR3), the part of antibody sequences 
most critical for specificity [6]. Structurally, the CDR3 
forms one of three loops that define the binding pocket 
of a Vh domain [4]; in our design, the two other loops 
(CDRl and 2) are thus part of the framework. Our li¬ 
braries are minimal on two accounts: the framework con¬ 
sists of a single domain of ^ 100 amino acids and the total 
diversity is 20^ = 1.6 x 10^ - all combinations of the 20 
natural amino acids at the 4 varied sites. For compari¬ 
son, the most commonly used antibody libraries consist 
of two domains {Vh and Vp) and have >10^ variants, 
with variation introduced at different CDRs [7]. Libraries 
based on Vh only are, however, known to be effective [8]. 
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“Minimalist libraries” have also been built by restrict¬ 
ing the alphabet of amino acids at the variables sites but 
contained > 10^^ variants [9-11]. One of the simplest li¬ 
braries demonstrated so far, built on a synthetic scaffold, 
still contained >10^ variants randomly sampled from a 
much larger pool of potential sequences [12]. 

B. Selection 

We screened our libraries by phage display for bind¬ 
ing to one of two targets, a neutral synthetic polymer, 
poly-vinylpyrrolidone (PVP), and a short DNA loop of 9 
nucleotides (Materials & Methods). Two previous stud¬ 
ies established the capacity of antibody phage display to 
select binders for these targets [2, 14]. Phage display is a 
standard high-throughput screening technique [15]. It is 
based on the fusion of each antibody sequence to the se¬ 
quence of the pill surface protein of the filamentous bac¬ 
teriophage Ml3, a natural virus of the bacterium E. coli 
with the shape of a 1 /im long and 10 nm wide cylin¬ 
der [15]. The engineered phage encapsulates the DNA 
sequence of an antibody and displays the corresponding 
polypeptide at its surface. Populations of up to 10^^ 
phages displaying a total diversity of up to 10^^ different 
antibodies can thus be manipulated. A round of selec¬ 
tion consists in retrieving the phages bound either to the 
bottom of a plate where the PVP target is attached or 
to magnetic beads where the DNA target is coated. It 
is followed by a round of amplification achieved by in¬ 
fecting bacteria with the selected phages. We performed 
experiments where each sequence is initially present in 
^ 10"^ copies and where targets are provided in excess. 
Starting either from a single library (single framework) 
or from a mixture of different libraries, three rounds of 
selection/amplification were performed. Although the 
enrichment of some of the sequences is intended to re¬ 
flect binding to the specified targets, other factors may 
contribute, such as sequence-specific differences in am¬ 
plification. In our experiments, such non target-specific 
selective factors can be detected but are non-dominant 
(Supporting Information). Our analysis and its interpre¬ 
tation, however, do not rely on the precise nature of the 
selective pressure. 


C. High-throughput sequencing 

We sequenced samples of 10^ — 10^ sequences at dif¬ 
ferent rounds of selection by Illumina Miseq pair-ended 
high-throughput sequencing [16]. The results give us an 
estimation of the relative frequencies // of each sequence 
i in the population at each round t = 0,1,2,3. In es¬ 
timating these frequencies, we take into account both 
sequencing and sampling errors (Materials & Methods). 

We define the selectivity to a target of each sequence 
i as its probability Si to pass a round of selection. This 
selectivity is inferred up to a multiplicative factor a as 


VuSegment CDR3 segment 

library-specific framework diversity fixed framework 
24 X 20^ X 1 

FIG. 1: Library design - We designed a total of 24 libraries 
with distinct frameworks and identical sequence diversity con¬ 
sisting of all 20^ = 1.6 X 10^ combinations of the 20 natural 
amino acids at 4 consecutive positions. The design follows 
the natural design of the variable (V) region of the heavy (if) 
chain of antibodies, which is assembled by joining three gene 
segments, the variable (Vn), diversity {Dh) and joining {Jh) 
segments. The library-specific parts of the frameworks (in 
blue) are from natural Vh and diversity is introduced at the 
third complementarity determining region (CDR3, in red), at 
the junction between Vh and DhJh^ a part of the sequence 
critical for specific binding to antigens; the Dh and Jh seg¬ 
ments (in black) are common to all libraries. 


the ratio of the frequency of the sequence before and 
after selection [17]: 

Si = a^. ( 1 ) 

The unknown multiplicative constant a reflects our lack 
of quantitative control over the rate of amplification of 
the sequences. In our calculations, we arbitrarily fix a 
so that Si = 1] we explain below how our conclusions 
depend on this choice. We compare the frequencies be¬ 
tween rounds t = 3 and t — 1 = 2, where sequences with 
highest selectivities are best represented. 

Previous studies have applied next-generation sequenc¬ 
ing to the outcome of phage display screens as a way to 
identify a large number of binders [18, 19] but have not 
investigated the distribution of the relative selectivities 
of these binders. 


D. Reproducibility and specificity 

Several observations based on the frequencies and 
amino-acid patterns of the sequences in populations 
under selection validate our experimental approach. 
(1) Screening the same library against the same target 
in separate experiments yields reproducible frequencies 
fl at the last round t = 3 (Figure S3). (2) Screening 
the same library against different targets yields target- 
specific amino-acid patterns (Figures S4). (3) Screen¬ 

ing two libraries against the same target yields library- 
specific amino-acid patterns (Figure S4). Taken together, 
these results show that enrichment of some of the se¬ 
quences is reproducible and arises from selection for spe¬ 
cific binding to the targets. We note that one feature of 
our experiments is critical for reproducibility: the initial 
populations maximize degeneracy (the number of copies 
of each sequence) rather than diversity (the number of 
distinct sequences). 
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II. RESULTS 


A. Hierarchy between libraries 

To compare the selective potentials of libraries built 
around different frameworks, we performed experiments 
in which the initial population of sequences consists of a 
mixture of libraries with distinct frameworks - a meta- 
library. The results of these experiments reveal a strik¬ 
ing hierarchy. Diverse members of a same library, i.e., 
sequences sharing a common framework, typically dom¬ 
inate. When repeating the experiment with an initial 
mixture of libraries that excludes the dominating library, 
another library dominates (Figure 2). Libraries not se¬ 
lected when mixed with other libraries nevertheless do 
contain sequences with detectable selectivities, as shown 
by screening them in isolation (Figure S4). These results 
are not explained by uneven representations of the li¬ 
braries in the initial population or by framework-specific 
differences during amplification (Figure S5). 

Differences in frameworks are thus generally more sig¬ 
nificant than differences between variable parts, even 
though these parts are clearly under selection for binding 
(Figures 3B and 3D). This result may not be surprising 
for very dissimilar frameworks, but our frameworks are 
all expected to share the same structural fold and some 
frameworks have few sequence differences. In particular, 
the dominating framework when selecting the mixture of 
all 24 libraries against the DNA target (Figure 2) is a 
germline human Vh framework, which dominates two li¬ 
braries built on frameworks derived from it by affinity 
maturations, which share respectively 65% and 85% of 
their amino acids. The observed hierarchy is target de¬ 
pendent: different frameworks dominate when screening 
the meta-library against different targets. Remarkably, 
when screening the 24 libraries against the PVP target 
(Figure S6), the dominating framework is the only other 
germline framework of the mixture (the SI framework). 
As noted previously, differences between frameworks also 
appear in the patterns of amino acids that are selected 
at the level of CDR3s (Figures S4C-E). 


B. Scaling within libraries 

To compare the selectivities of sequences sharing a 
common framework, and therefore differing by at most 
4 amino acids (Figure 1), we rank these sequences in 
decreasing order of their selectivity Si and plot these se¬ 
lectivities versus the ranks on a double logarithmic scale 
- a representation of the cumulative distribution of se¬ 
lectivities within a library. For several experiments, this 
representation reveals a power law: if s{r) is the selec¬ 
tivity of the sequence of rank r, then, for the sequences 
with top ranks. 


sir) ~ r 


( 2 ) 


Round 1 


Round 2 




frequencies 


frequencies 


FIG. 2: Hierarchy between libraries - Frequencies of the dif¬ 
ferent libraries in two successive rounds of selection against 
the DNA target. Black bars report selection of all 24 libraries 
and white bars selection of a subset of 21 libraries, excluding 
the 3 libraries above the red dotted line. At the second round 
(right), the population is enriched in sequences from one par¬ 
ticular library, the HG library, in contrast to what is observed 
at the first round (left). The subset of 21 libraries excludes 
the library dominating the mixture of all 24 libraries, which 
leads another library, the GHl library, to dominate. Within 
the two libraries, a diversity of GDR3 are selected (Figures 3B 
and 3D), with different patterns of amino acids (Figure S4). 
Enrichment from the other libraries can also be observed when 
they are screened in isolation (Supporting Information). 


Eigure 3A shows an example where the exponent is 
K 0.5. While this power law is observed for several 
libraries (different frameworks) and selective pressures 
(different targets), it is not systematic: deviations are of¬ 
ten observed for the very top sequences (Eigure 3B) and, 
for several experiments, a power law cannot be justified 
(Eigure 3D). 

Both the power law and its various deviations can, 
however, be rationalized under an elementary mathe¬ 
matical model. This model rests on two assumptions. 
Eirst, it assumes that the selectivity of each sequence 
in a library is drawn independently at random from a 
common probability density p{s), which may depend on 
the framework and the target. Second, it assumes that 
the sequences with top selectivities are in the tail of this 
probability density. 

The model is thus probabilistic even though - baring 
out experimental noise - the experiments have no in¬ 
herent stochastic element. To the extent that selectiv¬ 
ity reflects binding at thermodynamical equilibrium, the 
selectivity Si of antibody i is indeed determined by its 
binding free energy AGi to the target: Si oc ^ 

where T represents the temperature and ks the Boltz¬ 
mann constant. The binding free energy AGi is a phys¬ 
ical quantity which, in principle, is fully determined by 
the sequence of amino acids. In the spirit of applications 
of random matrix theory to nuclear physics [20], it may 
nevertheless be advantageous to discard this microscopic 
description in favor of a coarser probabilistic description, 
which treats the selectivities Si as instances of random 
variables independently drawn from a common probabil- 
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FIG. 3: Scaling relations within libraries - The selectivities Si 
of the sequences are represented versus their ranks for four 
experiences differing by the input library and the choice of 
the target against which it is selected. A. SI library against 
the PVP target. B. HG library against the DNA target. C. 
F3 library against the PVP target. D. GHl library against 
the DNA target. In A, the distribution of the top ~ 1000 
sequences follows a power law with exponent k. ~ 0.5. This 
behavior is consistent with the prediction of extreme value 
theory (EVT) when the shape parameter is positive, k > 0 
(see Figure 4 for the analysis that justifies this conclusion). 
Although not obvious from this representation, the data in B 
is also consistent with EVT when k > 0 while the data in G 
and D are consistent with EVT when, respectively, = 0 and 

K < 0. 


ity density p(s). In contrast to nuclear physics, no sym¬ 
metry constrains p(s) a priori but, if concerned only with 
the largest results from extreme value theory (EVT), 
the branch of probability theory dealing with extrema 
of random variables [21], do constrain the form of the 
tail of p{s) from which they originate, thus allowing for 
non-trivial predictions. 

EVT indeed indicates that random variables s inde¬ 
pendently drawn from the tail of a common probabil¬ 
ity density have themselves a probability density of the 
form [3] 


fn, 


:• (s) = /k 



( 3 ) 


with necessarily belonging to the generalized Pareto 
family: 


fK{x) 


(1 + Kx) i if K ^ 0, 

e~^ if /^ = 0, 


( 4 ) 


where the exponential for /i: = 0 is just the continuous 
limit of /^{x) when /i: ^ 0. Here, 5* represents a thresh¬ 


old above which the tail of p{s) is defined, r is a scaling 
factor (which absorbs the undetermined factor a intro¬ 
duced in Eq. (1)) and k > —1 is the so-called shape 
factor (independent of a), which defines the universality 
class to which the distribution of select ivities belongs: 
the probability densities p{s) may differ, but if they are 
associated with the same k, events drawn form their tails 
will share similar statistical properties. 

As suggested by the notations, when /^ > 0, but only 
when K > 0, this model predicts that the top ranked se¬ 
quences follow a power law with exponent /^, as described 
by Eq. (8). Mathematically, when considering a large 
number N of samples, the rank r{s) is indeed related to 
the cumulative distribution of selectivities by 

POO 

r{s) N / p{x)dx. (5) 

J s 

If p{s) ^ s~^ for large s as predicted by Eq. (4) for 
K > 0^ we must then have p{x) dx ^ and there¬ 
fore r{s) which is equivalent to Eq. (8). In other 

words, the power law seen in Figure 3A corresponds to 
the expected relationship between the rank and the val¬ 
ues of random variables drawn from the tail of a probabil¬ 
ity density when this density belongs to a class associated 
with /i: > 0. 

To precisely assess the ability of our model to de¬ 
scribe all these different cases, we followed the point-over¬ 
threshold approach, a standard method in applications of 
EVT to empirical data [3]. This approach consists in fit¬ 
ting the data Si satisfying Si > 5* by a function of the 
form (s) for different values of the threshold 5*, and 

then in estimating whether a threshold exists such 
that for s* > the inferred parameter ^(s*) is nearly 
independent of 5*. To apply this method, we inferred 
the parameters ^(s*) and t( 5*) by maximum likelihood 
from the data Si > 5* for every value of s*. For the 
data presented in Figure 3A, an illustration is provided 
in Figure 4A, with error-bars indicating 95% confidence 
intervals (see Supporting Information for the analysis of 
other experiments). In this example, we observe that 
k(s*) becomes nearly constant, of the order of 0.5, for 
s* > ^ 4 X 10“^. The determination of is per¬ 

formed by visual inspection but any choice of 5* > 5^-^ 
should give equivalent results. 

Given 5* > and the associated values of /^ = k(5*) 
and r = inferred from maximum likelihood, the 

next step is to estimate whether this best fit is indeed 
a good fit. The diagnosis is commonly performed vi¬ 
sually using probability-probability (P-P) and quantile- 
quantile (Q-Q) plots [3]. The P-P plot compares the 
empirical and modeled cumulative distributions by repre¬ 
senting the quantile function q{s) = r{s)/N (the fraction 
of the data above s) against the cumulative (s) = 

Jq fk,T,s*{x)dx. As indicated by Eq. (5), a straight line 
y = X is expected if the fit is perfect, which the inset of 
Figure 4B shows to be nearly the case in this example. 
The Q-Q plot makes a similar comparison but by repre¬ 
senting s against Q{q~^{s)), where q~^{x) represents 
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FIG. 4: Extreme value analysis by the point-over-threshold 
approach - A. Values of the inferred parameters from 

selectivities Si > s* as a function of the threshold s*. The 
inference is made by maximum likelihood and the error-bars 
indicate 95% conhdence intervals. Inset: similarly for t(s*), 
the second parameter of the model, which is estimated jointly 
to k{s*). For s* sufficiently large, s* > Smin, should be 

constant and t(s*) increase linearly with slope k{s*). This is 
observed here for s^in — 4 x 10“^ (red dotted line) and leads 
to = 0.45 ± 0.22 and r = 1.6 x 10“^ ± 10“^; = 0 can be 

excluded by likelihood ratio test with a p-value < 10“^. B. 
Quantile-quantile plot representing the data Si against pre¬ 
dictions from the model based on the inferred value of k only. 
A straight line is expected for a good ht with a slope and the 
y-intercept given by the two other parameters r and s*. Inset: 
probability-probability plot comparing the empirical cumula¬ 
tive distributions from the data to the cumulative distribution 
from the inferred model, showing an excellent agreement. The 
data for this hgure comes from the selection of the SI library 
against the PVP target as in Figure 3A (see Figures S8-S10 
for a similar analysis of the data shown in Figures 3B-D). 


the value of s above which a fraction x of the data is lo¬ 
cated. This representation has two advantages over the 
P-P plot: it relies only on the estimation of hz and it dis¬ 
plays more clearly the contribution of the most extreme 
values. A straight line is expected if the fit is perfect, but 
this time with a slope r and a y-intercept s*. The main 
panel of Figure 4B indicates again a very good fit in the 
illustrated case. 

Performing the same analysis on results of selections of 
various libraries against various targets, we find that the 
model is able to describe all the experiments (Figures S8- 
S12). Different values of k are obtained with differences 
that are statistically significant (Table SI). In particular, 
the three cases /i: > 0, /^ = 0 and /i: < 0 are represented. 

While many models can lead to a power law [23], our 
probabilistic model has the merit of explaining the vari¬ 
ous deviations from this behavior that the data exhibits. 
First, when /i: > 0, EVT predicts a power law with expo¬ 
nent hz for the top-sequences but accounts for deviations 
both for the very top-ranked sequences, which, under the 
model, may vary widely (Figure S7), and for sequences of 
smaller selectivities, where f,^ in Eq. (4) can provide an 
excellent fit well beyond the point where the power law 
applies (e.g.. Figures 4B and S8). Second, EVT predicts 
behaviors differing from a power law if the probability 
density p{s) belongs to an universality class associated 


with /^ < 0, consistently with the results of some of the 
experiments (e.g.. Figures 3D and SIO). 


III. DISCUSSION 

We presented a quantitative analysis of in vitro selec¬ 
tions of multiple libraries of partially randomized pro¬ 
teins with variations limited to four consecutive amino 
acids. The distribution of selectivities of the top se¬ 
quences is described by few parameters, with an inter¬ 
pretation provided by an elementary probabilistic model 
based on extreme value theory (EVT). 

Within a library whose members share a common 
framework, this distribution is characterized by a shape 
parameter /^, which may be either positive, negative or 
zero. This parameter is independent of the unknown fac¬ 
tor a in Eq. (1) and has several interpretations. For in¬ 
stance, it controls the relative spacing between selectiv¬ 
ities: ranking the sequences from best to worst, the ex¬ 
pected difference of selectivity between sequences at rank 
r and r + 1, = E[s^ —satisfies A^/Ai ^ 

i.e., the larger /^, the wider the spread between pheno¬ 
types in the library (Supporting Information). The shape 
parameter also provides a statistical answer to the fol¬ 
lowing question: if sampling N sequences yields a top 
sequence of selectivity si, what best selectivity s'^ may 
we expect from sampling N' > N sequences? The dif¬ 
ference E[si — si] is a sharply increasing function of k 
(Supporting Information; Figure S13); as consequence, 
multiplying by a factor 1000 the number of sequences 
when K = 0 is expected to have same effect as multi¬ 
plying it by a factor 2 when k = 0.2, if starting with 
N = 10^ sequences. 

Besides the shape parameter /^, the other parameters 
are the scaling parameter r, the threshold of selectivities 
5* that defines where the tails starts, and the fraction 
(/) of the data above this threshold (there is some free¬ 
dom in the choice of 5* on which both r and 0 depend: 
see Supporting Information). Within our experimental 
set-up where the selectivities are determined only up to 
a multiplicative factor (see Eq. (1)), the values of 5*,(/) 
and r obtained from different experiments cannot be di¬ 
rectly compared, but our selections with mixtures of li¬ 
braries suggest that s* varies from library to library on 
a scale larger than the scale of the differences of selectiv¬ 
ity within libraries. All the parameters of the model are 
found to be both framework and target dependent (Table 
SI). 

Based on these results, we propose these parameters 
as general descriptors of the selective potential of a pop¬ 
ulation of random variants facing a given selective con¬ 
straint. In particular, these descriptors could be applied 
to re-visit the fundamental problem of estimating the 
density of functional proteins or RNAs in sequence space. 
Previous studies have estimated this density by counting 
the number of different sequences enriched in in vitro se¬ 
lections [24, 25]. The results of such experiments depend 
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on experimental noise, which sets a lower limit Snoise oii 
detectable selectivities. In turn, our approach is depen¬ 
dent only on the library content and the selective pres¬ 
sure, provided 5* > Snoise- 

Power laws are seemingly ubiquitous in distributions 
of protein features [26, 27]. Most closely related to our 
work, the distribution of abundances of distinct antibody 
sequences in Zebrafishes has been shown to follow a power 
law with exponent a 1 [28, 29]. Only instantaneous fre¬ 
quencies, not select ivities, are accessible in such a case, 
but, assuming a homogeneous initial distribution of se¬ 
quences, frequencies and select ivities have same distribu¬ 
tion and a = K if K > 0. However, repeating n times the 
same selection leads to a = tik, which does not account 
for a stable exponent a > 0, which may arise in natural 
repertoires from fluctuating selective pressures [30]. One 
possible extension of our approach could be to explore 
this scenario by changing the target between successive 
rounds of selection. 

While many models can be consistent with a power 
law, our model based on EVT covers without additional 
assumption the deviations from a power law observed in 
the data; in particular, it can fit the data over a larger 
range of selectivities and account for non-power-law be¬ 
haviors. This is, however, not the first application of 
EVT to the description of biological variation: Gillespie 
first introduced it in models of evolutionary dynamics as 
a way to constrain the distribution of beneficial effects 
obtained when mutating a wild-type individual [31, 32]. 
He assumed /^ = 0, arguing that this class includes all 
“well-behaved” distributions, among which the exponen¬ 
tial, normal, log-normal and gamma distributions. Math¬ 
ematical models for the distribution of affinities in com¬ 
binatorial molecular libraries have also proposed that it 
should have universal features but only considered distri¬ 
butions in the exponential class k = 0 [33, 34]. 

Several experimental studies have recently investigated 
the value of k applicable to the distribution of benefi¬ 
cial effects in viral or bacterial populations [35, 36]. The 
sample sizes available in these studies are, however, in¬ 
sufficient to conclusively validate or invalidate the EVT 
hypothesis. In these experiments, the number of mutants 
found in the tail has indeed been so far very low, of the 
order of a dozen: estimating the sign of the shape param¬ 
eter K. can be attempted [37], but assessing the validity of 
the fit using quantile-quantile plots as in Eigure 4 is sim¬ 
ply impossible with such limited data. Our rich dataset 
thus provides the first thorough test of the applicability 
of EVT to the analysis of biological diversity. 

Comparable datasets are now being increasingly pro¬ 
duced. In particular, several groups have characterized 
the phenotype of every single-point mutant of a pro¬ 
tein [38]. Our model may be viewed as a mathematical 
formalization of the concept of a random library, from 
which single-points mutants may deviate. We note, how¬ 
ever, that selectivities from non-random subsets of one of 
our libraries do follow the same model as the full library 
(Eigure S14). In any case, significant deviations will have 


to be quantified against our null model. 

Beyond protein libraries, the model is relevant to the 
screening of synthetic chemical libraries, including the 
combinatorial libraries of small molecules developed in 
the pharmaceutical industry for drug discovery [39, 40]. 
In this context, one previous study was performed with 
enough data points to possibly discriminate between dif¬ 
ferent universality classes [41], although the authors con¬ 
sidered only the exponential case /^ = 0. 

Einally, our work raises a new question for future stud¬ 
ies: if the selective potential of a partially randomized 
library is captured by few parameters, and if these pa¬ 
rameters can vary from library to library, what controls 
them? More simply, what features of the framework de¬ 
fine a universality class? Eor instance, how extending 
the variable parts to other sites changes k? The pat¬ 
terns of amino acids forming the sequences, which we 
have analyzed here only to confirm the reproducibility 
of the experimental results and their specificity with re¬ 
spect to the targets and libraries, may provide valuable 
insights [29]. 

The question may also be asked at another level: can 
we or natural evolution control these parameters to op¬ 
timize the selective potential of a population? This 
question relates to the debated “evolution of evolvabil- 
ity” [42, 43], cast here into a concrete conceptual and 
experimental setting. Antibodies potentially define an 
excellent model system to study experimentally this ques¬ 
tion, since they are subject to selection and maturation 
towards a diversity of targets as part of their natural 
function. The approach and concepts introduced in this 
work provide the means to address the problem with 
quantitative experiments. 


Materials Methods 

Phage Display — PVP-plates were prepared as described 
in [2]. The DNA target was prepared by self-assembly 
of a hairpin DNA, labeled with biotin at its 5’ end (5’- 
Biotin- AAAAGACCCCATAGCGGTCTGCGT), and was 
purchased from Eurogentec (Angers, France). E. coli 
TGI competent cells were purchased from Lucigen Lt. 
Phage production, phage-display screens based on the pIT2 
phagemid vector and helper phage K07 production were 
performed following the standard protocol from Source 
BioScience (Cambridge, U.K.; http://lifesciences. sourcebio- 
science.com/media/143421/tomlinsonij.pdf) and our own 
previous work [2, 44] with some modifications specified in 
the Supporting Information. 

Sequencing data — Library phagemids were purified from 
E. coli stocks after each selection round using Midiprep kits 
from Macherey-Nagel (Hoerdt, France). v3 Illumina MiSeq 
sequencing was performed by Eurofins Genomics (Ebersberg, 
Germany). The MiSeq pair-ended technology was used. 
Frameworks were recovered on the forward read and only 
the reads having all the expected restriction sites and less 
than 4 errors on the 126 bases were kept. The CDR3 were 
accessible on the reverse read and only the reads having all 
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the expected restriction sites and a average value of quality 
read Q>30 on the 12 bases defining the CDR3 were kept (see 
Table S2 for an estimation of sequencing errors). 

Computational analysis — We infer the selectivity Si 
of an amino acid sequence i by Eq. (1) with t = 3 (third 
round of selection). The frequencies are simply given by 
// = nl/ rij where nl is the number of sequences i 
present in the sample. Given sampling errors, estimated as 
Asi/si = 1/y^^ + 1/y^^, and given sequencing errors, 
estimated at ~ 5% over the 12 bases of the CDR3 (Table 
S2), the estimation of Si is meaningful only for sequences 
that are sufhciently present at each round: nl~^ > no and 
nl > no. We took no = 10 and verified that the results are 
not sensitive to this exact value (Table S3). With no = 10, 
relative sampling errors are, in the worst case, as high as 
2/v^iio ~ 60%, but, assuming that sampling errors are 
uncorrelated, this uncertainty has no major incidence on 
the estimation of aggregated properties of the distribution 
of the largest Si, which involves several hundreds of different i. 

Extreme value statistics — We followed the standard ap¬ 
proach for modeling threshold excesses [3]. The parameters 
n and r were estimated by maximum likelihood and the 95% 
confidence intervals shown in Figure 4A were obtained un¬ 
der the hypothesis of normality by calculating the inverse of 


Fisher’s information. To ensure that the data allows us to dis¬ 
criminate between = 0 and % 0, a p-value was calculated 
by a likelihood ratio test, whose distribution was estimated 
both by numerical simulations. Maximum likelihood estima¬ 
tions are calculated on at least 50 data points. 
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SUPPORTING INFORMATION 


Supplementary experimental methods 


Library construction 

The library-specific parts of the frameworks, upstream of the variable CDR3 (Figure 1) are shown in Figure S21. 
23 of these frameworks were designed based on the amino acid sequences of 23 natural Vh segments, with minor 
modifications to accommodate common restriction sites at the two ends of the CDR2 and CDR3. Out of these 23 
frameworks, 20 were chosen to have minimal sequence similarities, and 3 are from a same human Vh segment: one 
is the germline (naiVe) form, one results from limited maturation (85% sequence similarity to the germline) and the 
other from extensive maturation (broadly neutralizing antibody against HIV [1] with only 65% sequence similarity 
to the germline). A 24th framework was made exclusively of glycines to serve as a control. Downstream of the 
CDR3, the fixed part has amino acid sequence FDYWGQGTLVTVSSG in all libraries. The nucleotide sequences 
were optimized for E. coli codon usage and are provided as supplementary file. 

The 24 frameworks were obtained from Genewiz (South Plainfield, NJ) as synthetic genes with restriction sites 
flanking the GDRs to allow for the introduction of arbitrary sequences at the GDRs. In particular, the GDR3 region is 
flanked by BssHI and Xhol sites. These synthetic genes were cloned into a modified version of pIT2 phagemid (standard 
phage display vector) lacking Vl [2]. To randomize the GDR3 region, a degenerate oligonucleotide containing 12 
random nucleotides (from Eurogentec, Angers, France) flanked by BssHI and Xhol sites was PGR-amplified, digested, 
and ligated into gel purified pIT2 phagemids harboring each of the 24 frameworks. Ligation products were purified 
and electroporated into TGI E. coli (from Lucigen, Middleton, WI) at efficiencies exceeding 10^ transformants (to 
ensure a >100-fold coverage of the 10^ diversity), while keeping 100-fold lower efficiency in control electroporations 
of ligation product without insert (to minimize the occurence of empty vectors in libraries, below 1%). 


Phage display screening 

All chemicals were purchased from Sigma-Aldrich (St Louis, MO) unless otherwise specified. Deionized water of 
resistivity 16 Mil.cm was produced with an ion exchange resin (Aquadem(R) system, Veolia, Lyon, France). 2xTY 
medium was prepared by dissolving 16 g tryptone, 10 g yeast extract, 5 g NaGl (tryptone and yeast extract from 
USBIO distributed by Euromedex, Strasbourg, Erance) in 1 L of deionized H20 and autoclaving for 15 min at 120 °G. 

The DNA target (PAGE purified, lyophilized) was resuspended with deionized water at 400 jjM. 20 mg of magnetic 
beads coated with streptavidin (Dynabeads(R) M-280 Streptavidin from Invitrogen Life Technologies SAS, Saint 
Aubin, Erance) were prepared according to the manufacturer’s protocol. 10 /iL of DNA stock solution were mixed 
with 20 mg of washed Dynabeads(R) and incubated for 10 minutes at room temperature using gentle rotation. The 
biotinylated hairpin DNA coated beads were separated with a strong magnet for 2-3 minutes and washed 2-3 times 
with a buffer containing 5 mM Tris-HGl (pH 7.5), 0.5 mM EDTA and IM NaGl. 

Phage production is the same as described before except that the infected TGI culture was grown for 7 hours 
(instead of overnight) at 30°G in 2xTY + 100 /rg/mL ampicillin + 50 /rg/mL kanamycin. 

During phage display experiments, the supernatant containing our library (around 10^^ phages) in 2xTY, was 
adjusted to 10 mM NaP04 pH=7.4. Phages were first incubated against either naked magnetic beads or non-treated 
polystyrene 3 cm diameter Nunc Petri dish (Thermo Eisher Scientific, Waltham, MA) for negative selection. Eor DNA 
target selection, DNA LoBind tubes (Eppendorf AG, Hamburg, Germany) were used. Phages were incubated during 
1 h without agitation and 30 min on a rocker at room temperature. The remaining phages were then incubated 
with either hairpin DNA or PVP targets. In the case of hairpin DNA, 50 /jL of beads were incubated with an 
excess of DNA targets (around 10^^), washed according to the manufacturer’s protocol, yielding on the order of 10^^ 
immobilized DNA targets, at a 100-fold excess over available phages (10^^). Antibody selection was then performed 
against either DNA-coated beads or a PVP-functionalzed Petri dish for 90 mn on a rocker. 10 washing steps with 
IxPBS + 0.1%Tween 20 were performed. Next, selected phages were eluted using 1 mL of fresh solution of 100 mM 
triethylamine for 20 min and neutralized with 500 /iL of Tris/HGl buffer (1 M, pH 7.4). Eluted phages were rescued 
by infection of an excess of exponentially growing TGI E. coli cells (14 mL of a 2xTY culture at O.D. 600 nm = 0.6) 
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for titration and phage preparation for subsequent rounds of selection. Infected TGI were then plated on 2xTY + 
ampicillin plates for overnight amplification at 37° C. Glycerol stocks were stored at — 80°G. 


Amplification biases 

Each round of selection is followed by a round of amplification consisting in infecting the bacteria with the selected 
phages. Sequence-specific differences in amplification may arise from differences of growth rate of the bacteria 
carrying different phagemids, or differences in infectivity or display ability of the phages. We measured by sequencing 
both the differences between frameworks when considering a mixture of the 24 libraries (Figure S9) and between 
GDR3 when considering a library of given framework (Figure S19). 

Between frameworks, only the SI and GHl libraries, show significant enrichment upon amplification alone. Each of 
these two libraries dominates over the others in one experiment of selection with a mixture of libraries but, when the 
mixture of all 24 libraries is selecting against the PVP target, they are dominated by another library, the HG library, 
which does not show any enrichment upon amplification. This observation, together with the strong correlation 
between frequencies before and after amplification (Figure S9B), are evidence that differences in library amplification 
are not responsible for the observed hierarchy between frameworks (Figure 2). 

Within each library, a clear enrichment for the glutamine amino acid is observed, irrespectively of the framework 
(Figure S19). This bias has a simple interpretation: the strain of E. Coli that we use for phage display is a partial 
amber stop codon suppressor. In this strain, the amber codon codes about one third of the times for a glutamine and 
acts as a stop codon the two other thirds. The reduced production of antibodies due to the presence of an amber codon 
thus confers a growth advantage to the bacteria (antibody expression is costly for E. coli). Gonsistently with this 
interpretation, we verify that all the glutamines present in the data are associated with the amber codon. The results 
presented in the paper exclude sequences with an amber codon but, in most experiments with selection, glutamine 
does not appear in the selected consensus sequence and considering the amber code as coding for an amino acid or for 
a stop codon has no incidence on the conclusions. Apart from glutamine amplification, no other significative pattern 
of amplification is visible or may plausibly explain the results of the experiments with selection. 


Supplementary properties of extreme value distributions 


Relations between parameters 

The fit to an extreme value distribution with parameters {kq^tq) applies for selectivities above a threshold Sq. 
Fitting the data above a larger threshold > Sq must lead to the same shape parameter ki = kq (simply denoted k) 
but to a different scaling parameter ri given by [3] 


Ti = To + - Sj). 


(S6) 


These are the relations verified in Figure 4A. 

Another independent parameter, which depends on the bulk of the distribution, is the fraction of the data 
above the threshold Sq, which obviously depends on the value of the threshold (0i < 0o when > Sq). 

In total, four parameters are thus relevant: 5*, 0, k and r. 


Spacings between extremes 

We show here that if 5i > 52 > ... are drawn at random with a probability density (s) given by Eq. (3) then 

their spacings defined by = 5^ — 5^+1 scale as A^/Ai ^ where k is the shape parameter. This follows 

from a more general result: 




(S7) 
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where r is the scaling parameter and N the total number of samples. 

The proof can be given in terms of the rescaled variable x = (5 — 5*)/t whose probability density /^{x) is defined 
in Eq. (4), since As indicated by Eq. (5), the rank r and the value x are related for large N by 

r{x)/N ^ f^(u)du = (1 + Inverting this relation gives x^ = q{r/N) with q{z) = (1 — k. In the 

limit of large N where the formalism applies, we have therefore x^ — ^r+i — — (l/A/')g'(r/A) with the derivative of 
q{z) given by ^'( 2 ;) = All together, this gives Xr — a^r+i — and thus ^ 


Scaling of maxima with library sizes 

We show here that if the maximum of N random variables drawn with a probability density given by Eq. (3) is 5i, 
then adding more elements to produce a library m > 1 times larger leads to a maximal value s'^ > si satisfying 


E[s; - si] = riV-m-Li.+i (l “ > 


(S8) 


where Li/c(z) = defines the so-called polylogarithmic function. E[s[ — 5i] is thus an increasing function 

of K as illustrated in Eigure SI7, where Eq. (S8) is also compared to numerical simulations. The relevance of this 
formula rests on the assumption that sub-libraries of a library with shape parameter k are characterized by the 
same /^, which finds support in the data (Eigure S18). In practice, the extreme value distribution applies only to 
the fraction (pN of the data above the threshold 5*. When expressed relative to the expected spacing between the 
top two values in the initial population of N variables, Ai = Eq. (S8) is however independent of N and r: 

E[s'i - Si]/Ai = m'^Li^+1 (1 - 

To derive this formula, we consider an initial population of size ruN whose maximum is s[ and define a subpopulation 
of (approximate) size N by retaining with probability 1/m each of its elements. The maximum si of this subpopulation 
has thus rank n in the initial population with probability Pn = {1 — - the probability that none of n — I 

top values are retained but that the n-th is. Eollowing Eq. (S7), the distance between si = s'^ and s[ is estimated as 
5'^ = s'l - sjj = Z)r=i = T{mNY Z)r=i This leads to 


^[4 - Sl] = '^Pn5'n = 


rimNy 


00 n —1 


n=l 


m 




n=lr=l 


l^n-l ^ 


T{mNY 


00 00 / ^ 

5 : S ('-m 




m 


r=l n=r-|-l 


n—1 


r.K+1 


(S9) 


and, after summing the geometric series ~ l/m)^ ^ = (I — I/m)’^m, to 


CX) . 

Efs'i — si] = rimNY (I- 

V ^ 


n=l 


n—1 


^Av+1 ’ 


(SIO) 


which is equivalent to Eq. (S8). 
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K, 

T 

s* 

0 

Sl/PVP 

0.44 ±0.22 

1.6 X ± 

0.37 X 10“® 

~ 6 X lO-^ 

F3/PVP 

0.07 ±0.21 

3.1 X IQ-^ ±8 X IQ-'^ 

1.2 X 10“® 

~ 6 X lO-^ 

HG/DNA 

0.26 ±0.21 

5.7 X 10“® ± 1.5 X 10“^ 

0.7 X 10“® 

~ 6 X lO-^ 

CHl/DNA 

-0.62 ±0.25 

2.5 X 10"^ ±8 X 10"® 

2.5 X 10"® 

~ 3.6 X 10"^ 


TABLE SI: Parameters r, s* and 0 describing the four experiments presented in Figure 3. Note that r and 0 depend on 
s*, which may be chosen within a finite interval of values. However, the values of t(s*) and 0(s*) at s* = So determine their 
values at s* = Si as indicated in Eq. (S6) for r. 



Fraction of sequences with 
>1 sequencing error / 12 bases 

Mix24 

0.043 

Mix24 amplified 

0.029 

Mix24/PVP round 1 

0.025 

Mix24/PVP round 2 

0.051 

Mix24/PVP round 3 

0.107 

Mix24/DNA round 1 

0.032 

Mix24/DNA round 2 

0.065 

Mix24/DNA round 3 

0.029 

Duplicate Mix24/DNA round 3 

0.024 

Mix21/PVP round 2 

4 X 10“® 

Mix21/PVP round 3 

4 X 10“® 

Mix21/DNA round 1 

0.027 

Mix21/DNA round 2 

0.046 

Mix21/DNA round 3 

0.106 

F3/PVP round 1 

0.034 

F3/PVP round 2 

0.029 

F3/PVP round 3 

0.04 

F3/DNA round 1 

0.027 

F3/DNA round 2 

0.048 

F3/DNA round 3 

0.085 


TABLE SIL Estimation of sequencing errors - Fraction of the sequences with at least one error in the 12 bases immediately 
downstream of the 12 based of the CDR3 (errors estimated given the known sequence of the fixed part of the framework). 


Sl/PVP in Mix24 uq = Uq = 10 

n = 0.34 ±0.22 

Sl/PVP in Mix24 nl = 10 and nl = 25 

K = 0.42 ±0.25 

Sl/PVP in Mix21 uq = no = 100 

n = 0.56 ±0.18 

Sl/PVP in Mix21 sampled uq = n^ = 10 

K = 0.48 ±0.16 

Sl/PVP in Mix21 sampled tiq = 10 and Uq = 50 

n = 0.67 ±0.37 


TABLE Sill: Robustness of the EVT analysis - The analysis presented in the main text retains only sequences present in 
sufficient number in the samples of the populations that are sequenced at the second and third rounds - namely nf > no = 10 
and nf > no = 10. This table shows that varying the values of the thresholds n^ and n^ has little incidence on the value of the 
shape parameter n inferred by EVT analysis. The sample of the SI library against PVP in the mixture of 21 libraries (Mix21, 
see Figure 2) contained 10® sequences while the sample in the mixture of 24 libraries (Mix24) contained only 10®; the two last 
rows of the table shows that further sampling at 1/10 the former to reach samples of comparable sizes have no incidence on 
the results. 
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Supplementary figures 
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FIG. S5: Diversity of the libraries - The different libraries are intended to harbor the same distribution of amino acids at the 4 
varied positions. We measured these distributions by sequencing samples from the initial libraries. A. Sequence logos showing 
the entropies of the various amino acids at the four positions: the distribution is non-uniform but similar from one library to 
the next. B. More quantitatively, the distance between distributions is estimated using the Jensen-Shannon divergence: if qi is 
the frequency of amino acid a in the CDR3 of library i, the Jensen-Shannon divergence between libraries k and i is defined as 
Qa ln(ga/ga) + /Qa) ■ This divergence is found to be 5 to 10 times larger than expected from sampling noise. This 

represents the experimental precision at which we were able to introduce the same diversity in each library. These differences 
of frequencies between initial libraries are, however, much smaller than the differences of frequencies before and after a round 
of selection within a same library. 



FIG. S6: Sequence similarities between frameworks - Similarity between two frameworks is measured as the fraction of common 
amino acids in an alignment of their two sequences. Only the library-specific part of the frameworks (Figure 1) defined in 
Figure S21 is considered here. In most cases, the sequence similarity is in the range 30-60%. 
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FIG. S7: Reproducibility - To assay the reproducibility of the experiments, two independent selections of the mixture of 24 
libraries were performed against the DNA target and the frequencies of the sequences were compared at the third round: the 
high correlation between the two results indicates high reproducibility. 
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FIG. S8: Library and target specificities - Relative entropies of the different amino acids at the third round of different 
experiments; the relative entropy is calculated per site as /“ ln(/“/^“) where /“ is the frequency of amino acid a at position i in 
the third round and qf in the initial library (round 0). A and B show that the consensus sequence is framework dependent. B 
and C show that it is target specific. Finally, C, D and E provide further evidence of framework dependency. (White squares 
indicate amino acid not represented in the population). 
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FIG. S9: Biases in amplifications - Comparison of the composition of the mixture of 24 libraries before and after amplihcation 
in absence of selection. A. Differences of frequencies, showing that only the SI and CHI libraries are enriched. B. Correlations 
between the frequencies (same data). 
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FIG. SIO: Target-dependent hierarchy - Figure 2 shows that a mixture of 24 libraries selected against the DNA target is 
dominated by the HG framework while a mixture of 21 libraries that excludes the HG library is dominated by the CHI 
framework. As shown in this hgure, when the same mixture of 24 libraries is selected against the PVP target, a different 
framework, the SI framework, dominates (consistently, it also dominates when screening the mixture of 21 libraries, which 
includes SI). 
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rank 

FIG. Sll: Variations in the values of extreme selectivities - When sampling N random variables from the extreme probability 
density f^ix) given by Eq. (4), the value Sr of the variable of rank r is distributed with a mean Sr and standard deviation 
Ssr. The ratio Ssr/sr is largest for the very top sequences, as shown here based on numerical simulations. This observation is 
consistent with deviations of the data from a power law observed for the very top selectivities even when /^ > 0 and when the 
overall fit with an extreme value distribution is good (Figures 4A-B). 
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FIG. S12: EVT analysis for the selection of the HG library against the DNA target (data shown in Figure 3B). A fit of 
the general model gives k = 0.26 ± 0.21, r = 5.7 x 10“^ ± 1.5 x 10“^ while a ht of the exponential model (k = 0) gives 
To = 8 X 10“^ ± 1.4 X 10“^; the exponential model is excluded with a p-value 1.4 x 10“^, in favor of /^ > 0. Note that the 
threshold s* = 10“^ above which the ht is stable and good is much below the value of the selectivity above which a power law 
is observed in Figure 3B, of the order of s = 10“^. 
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FIG. S13: EVT analysis for the selection of the F3 library against the PVP target (data shown in Figure 3D). A fit of 
the general model gives — 0.07 zb 0.21, r = 3.1 x 10“^ zb 8 x 10“^ while a fit of the exponential model (/^ = 0) gives 
To = 3.4 X 10“^ zb 6 X 10“^; the exponential model is excluded with a p-value 0.75, which is non significant. This data is 
therefore consistent with an exponential model = 0. 
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FIG. S14: EVT analysis for the selection of the GHl library against the DNA target (data shown in Figure 3G). A fit of 
the general model gives — —0.62 zb 0.25, r = 2.5 x 10“^ zb 8 x 10“^ while a fit of the exponential model (/^ = 0) gives 
To = 1.5 X 10“^ zb4 X 10“^; the exponential model is excluded with a p-value 10“^, in favor of /^ < 0. 
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FIG. S15: EVT analysis for the selection of the F3 library against the DNA target (data not shown in the main text). A 
fit of the general model gives — 0.97 zb 0.38, r — 2 x 10“^ zb 8 x 10“^ while a fit of the exponential model (/^ = 0) gives 
To = 7.5 X 10“^ zb 10“^; the exponential model is excluded with a p-value < 10“^, in favor of /^ > 0. 
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FIG. S16: EVT analysis for the selection of the N1 library in the mixture of 24 libraries against the PVP target (data not 
shown in the main text). A ht of the general model gives n = 0.38 zb 0.21, r = 1.3 x 10“^ zb 3 x 10“^ while a fit of the exponential 
model {Hi — 0) gives tq = 2.2 x 10“^ zb 3 x 10“^; the exponential model is excluded with a p-value < 10“^, in favor of > 0. 
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FIG. S17: Scaling of the best binder with the library size - To estimate the gain that sampling m times more the same library 
may provide as a function of the shape parameter Av, we show here the expected difference E[si — si] between the maximum 
of mN samples drawn with probability density /^(x) from Eq. (4) and the maximum si of N sub-samples. The plain lines are 
based on Eq. (S8) with r = 1 and N = 10^ and the dots are the results of numerical simulations (averaged over many draws), 
showing a good agreement between the two. Note how E[si — si] depends more strongly on k. than on m. 
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EIG. S18: Stability of the shape parameter k for non-random sub-samples of a same library - To test whether non-random 
sub-libraries may be expected to be described by the same shape parameter as the library from which they originate, we consider 
here the results of the selection of library SI against PVP (Eigure 3A), for which the consensus GDR3 has amino acids sequence 
GWYT and we make four non-overlapping sub-libraries consisting of sequences with GDR3 at distance d = 1 to 4 from this 
consensus, where the distance just counts the number of amino acid differences (number of mutations). This hgure shows the 
selectivity versus the rank of the sequences in these sub-libraries. An EVT analysis indicates that K{d = 1) = 0.33 ± 0.39, 
hi{d = 2) = 0.40 ± 0.26, K,{d = 3) = 0.30 ± 0.23, K,{d = 4) = 0.53 ± 0.22: all these values are comparable to the value 
tv = 0.44 ±0.22 of the shape parameter for the full library. 
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FIG. S19: Amplification bias - Relative entropy between frequencies of CDR3 sequences before and after amplification without 
selection, showing an enrichment in glutamine (represented by the letter Q). The results presented in the paper exclude 
sequences with an amber codon, which is responsible for this effect (see supplementary experimental methods), but, in most 
experiments with selection, glutamine does not appear in the selected consensus sequence and considering the amber code as 
coding for an amino acid or for a stop codon has no incidence on the conclusions. 
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FIG. S20: Reproducibility of selections against the PVP target - The results of two experiments of selection against the PVP 
target, one starting from a mixture of 24 libraries and the other from a subset of 21 libraries, which each are dominated by 
the SI library, not only lead to an identical consensus sequence (panel A) but to reproducible results by EVT analysis (Table 
Sill). In this case, not only are the initial populations different, but also potentially the targets since the experiments were 
performed 1.5 year apart and PVP is subject to aging: this may explain the imperfect correlations between frequencies (panel 
B; by contrast, the selection of the F3 library against PVP was performed at the same time than the selection of the mixture 
of 24 libraries and differences in consensus sequences cannot be due to differences of the targets in this case). 
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PAMAATELIQPDSWIKPGETLTITCRVSGASITDSSSHYGTAWIRQPAGKGLEWFN - 

PAMAAVELTQVTSVMLKPGDSLTLSCKVSGYSVTDNS—YATAWIRQPAGKGLEWIN - 

PAMAGEELTQPASMTVQPSQSLSINCKVS-YSVTS - YYTAWIRQPAGKGLEWIG - 

PAMAEIRLDQSSAWKRPGESVKISCKINGLDMTA - HYMHWIRQKPGKGLEWVG - 

PAMASQTLIESDSVIIKPDQSHKLTCTASGFNFGG--SWMA—WIRQSPGKGLEWVA - 

PAMAGQSLTSLGSWKRPGESVTLSCTLSGFSLDS - YWMSWIRQKPGKGLEWIG - 

PAMAQVQLRESGPSLVKPSQTLSLTCTTSGFSLTSYGVTW - FRQAPGKGLEWLG - 

PAMAAVTLDESGGGLQTPGGTLSLVCKGSGFTFND--YAMG—WMRQAPGKGLEWVA - 

PAMADVTLTESGGDVKRPGESLKLSCKASGFDFSS--YWMG—WVRQPPGKGLEFVS - 

PAMAEVTVSLSVPELVKPSEKLKLVCKVSGALITDGSKIHAVNYIRQFSGSGLEFLA - 

PAMAQITLDQPGSAAVKPSETVKLSCKVS - VSVTSYAWA— WIWQAPGKGLEYIG - 

PAMAQISLMESGPGTVKPTTTLQLTCKVTGASLTDSTNMYGVLWVRQPAGKGLEWLG - 

PAMASQTLQESGPGTVKPSESLRLTCTVSGFELTS - NAVTWIRQPPGKGLEWIG - 

PAMADVQLDQSESWIKLGGSHKLSCTASGFTFSD — YWMS — WIRQAPGKGLERVF - 

PAMAQVTLRESGPALVKPTQTLTLTCTFSGFSLSTSGMCVS — WIRQPPGKGLEWLA - 

PAMAQVQLQQSGPGLVKPSQTLSLTCAISGDSVSSNSAAWN — WIRQSPSRGLEWLG - 

PAMAEVQLVQSGAEVKKPGESLRISCKGSGYSFTS - YWISWVRQMPGKGLEWMG - 

PAMADIVLTQPKTEAATPGGSITLTCKVSGFALSS — YAMH — LVRQAPGQGLEWLL - 

PAMAQS -LEESRGGL IKPGGTLTLTCTASGFT I S S — YYMC — WVRQAPGKGLEWI G - 

PAMAEVTLIQPEAENGHPGGSMRLTCKTSGFDLDS — YAMS — WVRQVPGQGLEWIV - 

PAMAQLQLQESGPGLVKPSETLSLTCTVSGGSISSSSYYWG — WIRQPPGKGLEWIG - 

PAMAQLQLQESGPGLVKPSETLSLTCIVSGGSIGTTDHYWG — WIRQSPGKGLEWIG - 

PAMAQPQLQESGPTLVEASETLSLTCAVSGDSTAACNSFWG — WVRQPPGKGLEWVGSLS 


— SIYYDGG-INKKDSLKDKFVISRDTSSSTVILTGQDMQTEDTAVYYCAR 
— YIWGGGS-SYHKDSLKSKFSISKDGSSSTVTLRGQNLQTEDTAVYYCAR 
— YISNNGG-TVYSDKLKNKFSISRDTATNTITIRGQNLQTEDTAVYYCAR 
— RMDAGKNQAIYAESLKNQFTLTEDVPASTQCLEVKSLRTEDTAVYYCAR 
— T I SDT SGSK YYS S ALKGRFT I SRDNS KMEVYLHMASVRTEDTAVYYCAR 
— RIDSGTG-TTFTQSLKGQFSITKDTNKNMLYLEVKSLKTEDMAVYYCAR 

- EINNNGFMDRNPDLKSRLNITREISLSQVSLSLSRVTPEDTAVYYCAR 

— GIRNDGSYPIYGAALKGRATISRDNGQSTVRLQLNNLRAEDTGTYYCAR 
— ILEYDSDRRYFGQSLKGRFTTSRENSNSMLYLQMNSLRVEDTAMYYCAR 

- HINYAAGTALNPDLKSRLTLSRDTAKNEAYLEISGMTAGDTAMYYCAR 

- YLGSDGSSNPASSLKSRVTFTRDTSKNEIYLQMTSMKSEDSGTYYCAR 

- GIYYNGNTDYATTLKGRLTLSRDTNKGEVYFKLTEAKTEESATYYCAR 

- VIASNGGTAFADSLKNRVTITRDTGKKQVYLQMNGMEVKDTAMYYCAR 

— YIRHDGGTTNYADSLKGRFTISRDSKNNKLYLQMNNLHTEDTAVYYCAR 

- RIDWDDDKYYSTSLKTRLTISKDTSKNQVVLTMTNMDPVDTATYYCAR 

-RTYYRSKWYNDYAVSLKSRITINPDTSKNQFSLQLNSVTPEDTAVYYCAR 
— RIDPSDSYTNYSLSLKGHVTISADKSISTAYLQWSSLKASDTAMYYCAR 
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HCASYWNRGWTYHNPSLKSRLTLALDTPKNLVFLKLNSVTAADTATYYCAR 


FIG. S21: Amino acid sequences of the frameworks - Multiple sequence alignment of the library-specific part of the frameworks 
(Figure 1). The organism from which the sequence originate is indicated. 




